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Abstract 

In this paper, a comprehensive performance review of a MPI-based high-order spec- 
tral and mortar element method C++ toolbox is presented. The focus is put on the 
performance evaluation of several aspects with a particular emphasis on the paral- 
lel efficiency. The performance evaluation is analyzed and compared to predictions 
given by a heuristic model, the so-called F model. A tailor-made CFD computation 
benchmark case is introduced and used to carry out this review, stressing the par- 
ticular interest for commodity clusters. Conclusions are drawn from this extensive 
series of analyses and modeling leading to specific recommendations concerning such 
toolbox development and parallel implementation. 
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1 Introduction 



This paper provides a detailed performance evaluation of the C++ toolbox 
named Speculoos (for Spectral Unstructured Elements Object-Oriented System). 
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Speculoos is a spectral and mortar element analysis toolbox for the numer- 
ical solution of partial differential equations and more particularly for solv- 
ing incompressible unsteady fluid flow problems [1]. The main architecture 
choices and the parallel implementation were elaborated and implemented by 
Van Kemenade and Dubois-Pelerin [2,3]. Subsequently, Speculoos' C++ code 
has been growing up with additional layers enabling to tackle and simulate 
more specific and arduous CFD problems: viscoelastic flows by Fietier and 
Deville [4-6], fluid-structure interaction problems by Bodard and Deville [7], 
large-eddy simulations of confined turbulent flows by Bouffanais et al. [8, 9] 
and free-surface flows by Bouffanais and Deville [10]. 

It is well known that spectral element methods are amenable easily to paral- 
lelization as they are intrinsically a natural way of decomposing a geometrical 
domain [11] and Chap. 8 of [12]. 

The numerous references previously given and the ongoing simulations based 
on Speculoos highlight the achieved versatility and flexibility of this C++ tool- 
box. Nevertheless, ten years have passed between the first version of Specu- 
loos' code and now, and tremendous changes have occurred at both hard- 
ware and software levels: fast dual DDR memory, RISC architectures, 64-bit 
memory addressing, compilers improvement, libraries optimization, libraries 
parallelization, increase in inter- connecting switch performance, etc. 

Back in 1995, Speculoos was commonly compiled and was running on HP, Sil- 
icon Graphics workstations and also on the Swiss- Tx machine, a commodity- 
technology based computer with enhanced interconnect link between proces- 
sors [13]. Currently most of the simulations based on Speculoos are compiled 
and are running on commodity clusters. The workstation world experienced 
a technical revolution with the advent of 'cheap' RISC processors leading to 
the ongoing impressive development of parallel architectures such as massively 
parallel clusters and commodity clusters. As a matter of fact, Speculoos bene- 
fited from this fast technical evolution as it was originally developed as to run 
in a single program, multiple data mode (SPMD) on a distributed-memory 
computer. The performance evaluations presented here are demonstrating the 
correlation between the good performances measured with Speculoos and the 
adequation of this code structure with the current hardware and software 
evolutions in parallel computing. 

This paper is organized as follows. In Section [2] we introduce the numerical 
context in which Speculoos was initiated, the software aspects related to its 
implementation and the variable-size benchmark test case used for the perfor- 
mance evaluation presented in the subsequent sections. Section [3] is devoted to 
the parallel performance analysis achieved on RISC-based commodity clusters. 
Finally, in Section H] we draw some conclusions on the results obtained. 
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2 Speculoos numerical and software context 

In this section, is gathered the necessary background information regarding the 
numerical method — namely the spectral and mortar element method — , the 
object-oriented concept and the parallel paradigm, essential roots embodied 
in Speculoos. The final Section 12.31 introduces the simulation used throughout 
this study as benchmark evaluation test case. 

2.1 Spectral and mortar element method 

The spectral element method (SEM) is a high-order spatial discretization 
method for the approximate Galerkin solution of partial differential equations 
expressed in weak forms. The SEM relies on expansions on Lagrangian inter- 
polants bases used in conjunction with particular Gauss-Lobatto and Gauss- 
Lobatto-Jacobi quadrature rules [14,15]. As high-order finite element tech- 
niques, the SEM can deal with arbitrary complex geometry where /i-refinement 
is achieved by increasing the number of spectral elements and p-refinement 
by increasing the Lagrangian polynomial order within the elements. From a 
high-order precision viewpoint, SEM is comparable to spectral methods as an 
exponential rate-of-convergence is observed when smooth solutions to regular 
problems are sought. 

C°-continuity across element interfaces requires the exact same interpolation 
in each and every spectral element sharing a common interface. The associ- 
ated caveat to such conforming configurations is the over-refinement meshing 
generated in low-gradient zones. The adopted remedy to such nuisance is a 
technique developed by Bernardi et al. [16] referred to as the mortar ele- 
ment method. Mortars can be viewed as variational patches of the discontinu- 
ous field along the element interfaces. They relax the C°-continuity condition 
while preserving exponential rate-of-convergence, and thus allow polynomial 
nonconformities along element interfaces. 

2.2 Parallel implementation 

The complexity and the size of the large three-dimensional problems tackled 
by numericists in their simulations require top computational performance ac- 
cessible from highly parallelized algorithms running on parallel architectures. 
As mentioned in [2], the implementation of concurrency in Speculoos was 
based on the concept that concurrency is a painful implementation constraint 
going against the high-level object-oriented programming concepts. As a mat- 
ter of consequence, Speculoos parallelization was kept very low-level. In most 



3 



higher-level operations parallelism does not even show up. 

From a computational viewpoint, systems discretized with a high-order spec- 
tral element method rely mainly on optimized tensor-product operations tak- 
ing place at the spectral element level. The natural data distribution for high- 
order spectral element methods is based on an elemental decomposition in 
which the spectral elements are distributed to the processors available for the 
run. It is worth noting that for very large computations, the number of spec- 
tral elements can become relatively important as compared to the number of 
processors available for the computation. The design of Speculoos makes it 
possible to have several elements sitting on a single processor. Nodal values 
on subdomain interface boundaries are stored redundantly on each proces- 
sor corresponding to the spectral elements having this interface in common. 
Moreover, this approach is consistent with the element-based storage scheme 
which minimizes the inter-processor communications. Inter-processor commu- 
nication is completed by MPI instructions [17]. 



2.3 Benchmark evaluation test case description 



As a common practice in performance evaluation, it is important to build 
a tailor-made benchmark based on a numerical simulation corresponding to 
a concrete situation. Before proceeding to the first step of our performance 
evaluation, we have short-listed some key parameters that have the most sig- 
nificant impact on the performance of our toolbox: single-processor optimiza- 
tion on the three computer architectures described in Tabled! single-processor 
profiling analysis, parallel implementation and scalability (including speedup, 
efficiency, communication times) and parallel implementation and processor 
dispatching. 

A test case has been developed for this benchmark and for the parallel bench- 
marking, see Sec. [31 This test case belongs to the field of CFD and consists 
in solving the Navier-Stokes equations for a viscous Newtonian incompress- 
ible fluid. Based on the problem at hand, it is always physically rewarding 
to non-dimensionalize the governing Navier-Stokes equations which take the 
following general form 

^ + u- Vu= -Vj9+— Au + f, V(x,t)efix/, (1) 

ot Ke 

V-u = 0, y{x,t)enxl, (2) 

where u is the velocity fleld, p the reduced pressure (normalized by the con- 
stant fluid density), f the body force per unit mass and Re the Reynolds 



4 



number expressed as 



in terms of the characteristic length L, the characteristic velocity U and the 
constant kinematic viscosity u. The system evolution is studied in the time 
interval I = [to, T]. From the physical viewpoint, Eqs. ([I])-® are derived from 
the conservation of momentum and the conservation of mass respectively. For 
incompressible viscous fluids, the conservation of mass also called continuity 
equation, enforces a divergence-free velocity field as expressed by Eq. (|2]). 
Considering particular flows, the governing Navier-Stokes equations ([I])-© 
are supplemented with appropriate boundary conditions for the fluid velocity 
u and/or for the local stress at the boundary. For time-dependent problems, 
a given divergence-free velocity field is required as initial condition in the 
internal fluid domain. 

All our computations were carried out using two time integrators: the implicit 
backward-differentiation formula (BDF) of order 2 for the treatment of the 
Stokes operator and an extrapolation scheme (EX) [18,19] of same order for the 
nonlinear convective term. One type of pressure decomposition mode, based 
on a fractional-step method using pressure correction namely BPl-PC [20-22] 
is used. 

Speculoos uses a Legendre SEM [12, 14, 15] for the spatial discretization of 
the Navier-Stokes equations. For the sake of simplicity the same polynomial 
order has been chosen in the different spatial directions {N^ = Ny = = N) . 
Moreover, to prevent any spurious oscillations in our Navier-Stokes compu- 
tations, the choice of a staggered Pat — PAr_2 interpolation method for the 
velocity and pressure respectively, has been made [12,23]. As a consequence 
of this choice of a staggered grid, the inner-element grid for the a;-, y- and 
2;-component of the velocity field is a Gauss-Lobatto-Legendre grid made up 
with (A^ -|- 1)^ quadrature (nodal) points and the grid for the pressure is a 
Gauss-Legendre grid made up with [(A^ — 2) -|- 1]^ quadrature (nodal) points, 
in each spectral element. 

The test case corresponds to the fully three-dimensional simulation of the 
flow enclosed in a lid-driven cubical cavity at the Reynolds number of 12 000 
placing us in the locally-turbulent regime. It corresponds to the case denoted 
under-resolved DNS (UDNS) in Bouffanais et al. [8,9]. The reader is referred 
to Bouffanais et al. [8, 9] for full details on the numerical method and on the 
parameters used throughout the present paper. 
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3 Pcirallel implementation 



In the sequel, we will assume that the reader is familiar with the basics of 
parameterization on a parallel machine. For a complete introduction to these 
notions we refer the reader to the following references [24,25]. 

The speedup S of an application on a given parallel machine can be described 
as 

^ Computing time on one processor Ti 

CPU plus communication times on P processors Tp + Tc 

If we suppose that the computing effort strictly scales with P, then Ti — PTp 
and the speedup can be written as 



Ti PTp 



Tp + Tc Tp + Tc l + 7m/7a 1 + 1/r' 

where 



(5) 



number of operations [MFlop] , . 

amount of data to transfer [MWord] ' 

is related to the application and 

effective processor performance [MFlops] , , 

effective communication bandwidth per processor [MWords] ' 

to the machine, and F = ■ja/jm- The reader is referred to [24] for full details 
on such parameterisation to tailor commodity clusters to applications. The 
efficiency £■ of a parallel machine is defined by 

E=- = (8) 
P 1 + 1/F ^ ^ 



3.1 Speculoos characteristics 



Speculoos uses a small amount of main memory. Parallelization is made in 
order to reduce the high overall computing time. The number of elements and 
the polynomial degrees in the three space directions are denoted by E^, Ey, 
and Ez, and A^^^, Ny, and A^^, respectively. The total number of independent 
variables per element is therefore x (A^^: + 1) x {Ny + 1) x {N^ + 1), where 
is the number of vector components per Gauss-Lobatto-Legendre (GLL) 
quadrature point. In addition, there are E^ x Ey x E^ elements. 



6 



3.2 Hardware and software used 



To perform the Speculoos code benchmark, the machines presented in Table [T] 
have been used. 



Name 


Manufact ur er 


CPU type 


Nodes 


Cores 


Interconnect 


Gele 


Cray 


Opteron DC 


16 


32 


SeaStar 


Pleiades 


Logics 


Pentium 4 


132 


132 


FE 


Pleiades2 


Dell 


Xeon 


120 


120 


GbE 


Pleiades2+ 


Dell 


Xeon 5150 


99 


396 


GbE 



Table 1 

Characteristics of the machines used for the benchmark. DC=Dual-Core. FE=Fast 
Ethernet. GbE=Gigabit Ethernet. 



As mentioned previously, the Speculoos code is written in C++, uses BLAS 
operations and implements the Message Passing Interface (MPI). 

The PAPl (Performance API) [26] available on the Cray XT3 machine was 
used to measure the number O of operations (in GFlops) and the MFlops rate 
of Speculoos. The VAMOS service available on the three Pleiades clusters [27] 
maps the hardware related data from the Ganglia monitoring tool [28] with 
the application and user related data (from cluster Resource Management 
System and Scheduler). We used the most aggressive optimization flag on all 
machines (-03 flag). 

3.3 Fixed problem size 

The first measurements are done on Pleiades2 with a fixed problem size, 
= Ey = E, = 8; = Ny = = 8; O = 155.4 GFlops, and varying the 
number P of processing elements from 1 to 32. The evolution of the runtime 
(for one time-step), the associated MFlops rate, and the efficiency E are given 
in Table [2j The speedup S' as a function of the number of processors is plotted 
in Fig. [H One observes that with 8 processors a speedup of 7 can be reached 
and a speedup of 30 with 45 processors. 

3.4 Increase CPU performance 

In this section, the number of processors on a Cray XT3 is kept fixed at 
the value P = 4. Then, we modify the polynomial degree and measure the 
MFlops rate. The MFlops rate performance metric for each process element 
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P GFlops Runtime (1 step) E 
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7 


4.101 


37.89 


0.92 


8 


5.590 


34.52 


0.88 


16 


8.346 


18.62 


0.82 


32 


14.179 


10.96 


0.70 



Table 2 

Evolution of GFlops rate and runtime for fourth time-step. E: Efficiency. 

is shown on Table [31 It increases as the problem size increases. As expected, 
one deduces that there is a limit on the number of processors that should be 
used in parallel. 



Ex — Ey — Ez 


Nx- 


-Ny- 




MFlops 


Walltime 


8-8-8 


6 


- 6 - 


6 


1624 


18.54 


8-8-8 


7 


-7- 


7 


2580 


29.79 


8-8-8 


8 


- 8 - 


8 


3100 


50.07 


8-8-8 


9 


- 9 - 


9 


3700 


83.12 


8-8-8 


10 - 


- 10 - 


- 10 


4150 


146.97 


8-8-8 


11 - 


- 11 - 


- 11 


4390 


257.36 



Table 3 

Evolution of MFlops rate and runtime for one time-step on 4 Cray XT3 dual-CPU 
nodes as a function of the polynomial degree. 

3.5 Varying the number of processing element P with problem size 

A more common way to measure scalability, and to overcome Amdahl's law, is 
to fix the problem size per processor and to increase the number of processors 
with the overall problem size. In other words, one tries to fix F that measures 
the ratio between processor needs over communication needs. We show in 
TableHlthe scalability of Speculoos on the Pleiades2+ cluster. It was compiled 
using MPICH2 and ice C++ compiler version 9.1e. 
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Fig. 1. Speedup of Speculoos code on the Pleiades2 (Xeon CPU). 
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Table 4 

Scalability of Speculoos. Same polynomial degree, same number of elements on each 
computing node on Pleiades2+ (Woodcrest) cluster. (A): with 4 MPI threads per 
node. (B): with npernode = 2 , two MPI threads per node. 

Table H] (A) shows results obtained when all the 4 cores are active for P > 1. 
Note that one Woodcrest node with 2 dual-core processors (Table H]) is slightly 
faster than 4 dual-CPU nodes (Table [3]) of the Cray XT3. When increasing the 
number of nodes with the problem size, the MFlops rate per core remains the 
same for all the cases. At this point, it is legitimate to determine if Speculoos 
is memory or processor bound. To find out, all the test cases in Table H] have 
been resubmitted to the Woodcrest nodes, first (A) using all the 4 cores per 
node, then (B) restricting to two the maximal number of MPI threads per 
node. Thus, instead of 16 nodes, 32 nodes were used to run the 64-processor 
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case (see Table H] (B)). One sees that the overall CPU time has been reduced 
by 20%, but the number of nodes was doubled. This shows that Speculoos 
includes parts that are processor bound and others that are memory bound. 
As a consequence, using all 4 cores does not give a two fold speedup (as one 
expects for a processor bound program) but neither the speedup is zero (as 
for a main memory bound application). Therefore, it is always more efficient 
to run Speculoos on all the 4 cores per node. 

3.6 CPU usage and the V model 

CPU usage has been monitored by the VAMOS monitoring service [27] avail- 
able on the Pleiades clusters. It provides information on the application's 
behavior. The higher the CPU usage is, the better the machine fits to the ap- 
plication. To perform that monitoring we took the same problem size {E^ = 
Ey = Ez = 8 and = Ny = = 8) during the same computing dura- 
tion (10 hours = 36'000 seconds). The application is run for 10 hours and 
the number of time-steps performed during this time is counted. With such a 
methodology, we ensure that each sample can perform a maximum of calcu- 
lations in a given amount of time. It is equivalent to set the same number of 
iteration for each sample and to measure the walltime. 

Figure [2] shows the different behavior of Speculoos on the three different 
Pleiades architectures. The F value — introduced in Eq. and, which re- 
flects the "fitness" of a given application on a given machine [24] — is also 
computed. Results are reported in Table [51 

Using the notations introduced earlier, T, Tp, Tc, and denote the total 
walltime, the CPU time for P processing elements, the time to communicate, 
and the latency time per iteration step, respectively. Then, 

T = Tp + Tc + Tl, (9) 

and the parameter F is easily expressed as 

It is possible to measure the total time T by means of an interpretation of 
the CPU usage plots (see Fig. [21). Indeed, the middleware Ganglia determines 
for every time interval of 20 seconds the average CPU usage (or efficiency E) 
for each processing element. This information has to be put into relation to 
the Speculoos apphcation. This is done via the middleware VAMOS. In the 
plots in Fig. [21, are added up all the values of E that lie in between x and 
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CPU Usage for Job 149812 (produced by VHHUS} 
25BBB I 1 1 1 1 1 1 1 




IB 20 30 'lO 50 GO 70 00 90 ISO 

effieiency in % 

CPU Usage for Job 139316 (produced by VHHBS} 
25BBB I 1 1 1 1 1 1 1 |i 1 n 




IB 20 30 'lO 50 GO 70 00 90 ISO 



Fig. 2. CPU Usage of Speculoos on different machines. Top: Pleiades cluster (CPU 
usage average 51.05% ,r = 1.04). Middle: Pleiades2 cluster (CPU usage average 
= 79.24%, r = 3.81). Bottom: Pleiades2+ cluster (CPU usage average 61.6%, 

r = 1.60). 
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T [s] r b [MB/s] W [words] Tp [s] Tc [s] Tl [s] 



Pleiades 


23.01* 


1.44* 


12* 


180 * 10^ 


13.58 


8.43 


1 


Pleiades2 


9.55* 


3.81* 


101 


180 * 10^ 


7.56 


0.98 


1 


Pleiades2+ 


12.89* 


1.60* 


101 


180 * 10*5 


7.93 


3.96 


1 



Table 5 

Measured (*) and computed quantities using the F model. 

X + 0.01, where x is the percentile represented on the abscissae of the plots. 
The efficiency E is related to the F through 

- ^ fin 



l-E 

What can also be estimated are the network bandwidths h of the GbE switch 
(between = 90 and 100 MB/s per link), the network bandwidth of the Fast 
Ethernet switch (between 6 = 10 and 12 MB/s per link) and the latency 
(L = 60 yus for both networks). First, a consistency test of those quantities 
is performed. Assuming that the Fast Ethernet switch has a fix bandwidth of 
hi = 12 MB/s, and for the GbE switch 62 = c^bi, with a unknown. Another 
unknown is the number of words W that is sent per node to the other nodes, 
and Tc = W/b. Based on the previous assumptions, the three F values for the 
three machines and the two networks is expressed as 

T 

= W^i + T,' ^^^^ 



W/b2 + TL 
W/b2 + TL 



(13) 
(14) 



These constitute a set of three equations for three unknown variables, namely 
W, a, and Tl. Solving for these variables leads to = 1, = 180 MWords, 
and a = 8.43. The value of 62 = 101 MB/s corresponds precisely to the one 
measured. This means that the model is well applicable. 



3. 7 Modification of the number of running threads per SMP node 



To study if Speculoos is dominated by inter-node communications. Figure El 
shows the result of two runs of the same problem size [E^ = Ey = E^ = 8 and 
Nx = Ny = Nz = 8) made respectively on 4 and 8 Woodcrest nodes during the 
same period of time (Ih = 3600 seconds) and counting the number of iteration 
steps. The ffist sample was launched forcing 2 MPI threads on each node and 
the second with 4 MPI threads on each node. 
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Fig. 3. CPU Usage on the 5100-series SMP node of Pleiades2+ cluster. 16 process- 
ing elements were required. 8 nodes/2 cores with 2 MPI threads per nodes in the 
upper case, 4 nodes/4 cores with 4 MPI threads per node in the lower case. 

We have to note that the CPU usage (system+user+nice) monitored by Gan- 
glia is the sum of all the process elements. For instance, for a dual-processor 
machine, when Ganglia measures 50% CPU usage, it means that each pro- 
cessor run at 100%. In Figure [3], when 2 MPI threads are blocked per node, 
we get a CPU usage of 51.13% while 157 iteration loops have been performed 
during one hour; when 4 MPI threads run on each node, we get a CPU us- 
age of 87.25% while only 117 iteration loops have been performed during one 
hour. Thus, the real CPU usage for the sample with 2 MPI threads per node 
is above 100% (2 cores are unused). 



4 Conclusions 



The extensive performance review presented in this paper for the high-order 
spectral and mortar element method C++ toolbox, Speculoos, has shown that 
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good performances can be achieved even with relatively common internode 
network communication systems, available software and hardware resources — 
small commodity clusters with non-proprietary compilers installed on it. 

We can conclude that the main implementation choices made a decade ago re- 
veal their promises. Even though those choices could have been questionable 
ten years ago, they are now in line with the current trend in computer ar- 
chitecture developments with the generalization of commodity and massively 
parallel clusters. 

The parallel implementation of Speculoos based on MPI has shown to be 
efficient. Reasonable scalability and efficiency can be achieved on commodity 
clusters. The results support the original choices made in Speculoos parallel 
implementation by keeping it at a very low-level. 

One of the goal of this study was to estimate if Speculoos could run on a mas- 
sively parallel computer architecture comprising thousands of computational 
units, specifically on the IBM Blue Gene machine at EPFL with 4'096 dual 
processor units. The performance of one processor corresponds to approxi- 
mately half of the performance of one processor on the Pleiades commodity 
cluster. Each Blue Gene node has 512 MB of main memory. A block with 
4x4x4 elements and a polynomial degree of = 8 in each space direction 
takes 200 MB of main memory. In a first step, one block per node will run on 
one node. Later, Speculoos will be modified to accommodate one block per 
processor, i.e. two blocks per node. A 4'096 blocks Speculoos case would offer 
the opportunity to run very accurate simulations of turbulent flows with more 
than half a billion of unknowns. Such a case would well scale on the IBM Blue 
Gene solution. In fact, the point-to-point operations per node do not change 
with the number of nodes. The Gigabit- Ethernet network can well handle 
the corresponding communications. The all-reduce operations scale logarith- 
mically with the number of computational units. A special efficient Fat Tree 
network takes care of all multicast communications. As a consequence, large 
Speculoos cases will perfectly scale on EPFL's Blue Gene machine. 
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